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Abstract 

In many real- world applications data come as discrete metric spaces sampled around 1 -dimensional 
filamentary structures that can be seen as metric graphs. In this paper we address the metric recon- 
struction problem of such filamentary structures from data sampled around them. We prove that they 
can be approximated, with respect to the Gromov-Hausdorff distance by well-chosen Reeb graphs 
(and some of their variants) and we provide an efficient and easy to implement algorithm to compute 
such approximations in almost linear time. We illustrate the performances of our algorithm on a few 
synthetic and real data sets. 

1 Introduction 

Motivation. With the advance of sensor technology, computing power and Internet, massive amounts 
of geometric data are being generated and collected in various areas of science, engineering and business. 
As they are becoming widely available, there is a real need to analyze and visualize these large scale 
geometric data to extract useful information out of them. In many cases these data are not embedded in 
Euclidean spaces and come as (finite) sets of points with pairwise distances information, i.e. (discrete) 
metric spaces. A large amount of research has been done on dimensionality reduction, manifold learning 
and geometric inference for data embedded in, possibly high dimensional, Euclidean spaces and assumed 
to be concentrated around low dimensional manifolds [5, 29, 34]. However, the assumption of data lying 
on a manifold may fail in many applications. In addition, the strategy of representing data by points 
in Euclidean space may introduce large metric distortions as the data may lie in highly curved spaces, 
instead of in flat Euclidean space raising many difficulties in the analysis of metric data. In the past 
decade, with the development of topological methods in data analysis, new theories such as topological 
persistence (see, for example, [21, 36, 8, 10]) and new tools such as the Mapper algorithm [33] have 
given rise to new algorithms to extract and visualize geometric and topological information from metric 
data without the need of an embedding into an Euclidean space. In this paper we focus on a simple 
but important setting where the underlying geometric structure approximating the data can be seen as 
a branching filamentary structure i.e., more precisely, as a metric graph which is a topological graph 
endowed with a length assigned to each edge (see Section 2). Such structures appear naturally in various 
real-world data such as collections of GPS traces collected by vehicles on a road network, earthquakes 
distributions that concentrate around geological faults, distributions of galaxies in the universe, networks 
of blood vessels in anatomy or hydrographic networks in geography just to name a few of them. It is thus 
appealing to try to capture such filamentary structures and to approximate the data by metric graphs that 
will summarize the metric and allow convenient visualization. 
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Contribution In this paper we address the metric reconstruction problem for filamentary structures. 
The input of our method and algorithm is a metric space (X, dx) that is assumed to be close with 
respect to the so-called Gromov-Hausdorff distance dcH to a much simpler, but unknown, metric graph 
(G' 3 dQ>). Our algorithm outputs a metric graph (G, da) that is proven to be close to (X, dx)- Our 
approach relies on the notion of Reeb graph (and some variants of it introduced in Section 3.1) and one 
of our main theoretical result can be stated as follows. 

Theorem 3.10. Let (X, dx) be a compact connected geodesic space, let r E X be a fixed base point 
such that the metric Reeb graph (G, do) of the function d — dx(r, .) : X — >► R is a finite graph. If for a 
given e > there exists a finite metric graph (G 7 , dc) such that dcH {X, G f ) < e then we have 

d GH (X, G) < 203i(G) + 1)(17 + 8N Efi ,(8e))e 

where Ne,g'(& s ) is the number of edges of G 1 of length at most 8e and /?i(G) is the first Betti number 
ofG, i.e. the number of edges to remove from G to get a spanning tree. In particular if X is at distance 
less than efrom a metric graph with shortest edge larger than 8e then dcH (X, G) < 34(/?i(G) + l)e. 

Turning this result into a practical algorithm requires to address two issues: 

- First, raw data usually do not come as geodesic spaces. They are given as discrete sets of point 
(and thus not connected metric spaces) sampled from the underlying space (X, dx)- Moreover in 
many cases only distances between nearby points are known. A geodesic space (see Section 2 for 
a definition of geodesic space) can then be obtained from these raw data as a neighborhood graph 
where nearby points are connected by edges whose length is equal to their pairwise distance. The 
shortest path distance in this graph is then used as the metric. In our experiments we use this new 
metric as the input of our algorithm. The question of the approximation of the metric on X by the 
metric induced on the neighborhood graphs is out of the scope of this paper. 

- Second, approximating the Reeb graph (G, do) from a neighborhood graph is usually not obvious. 
If we compute the Reeb graph of the distance function to a given point defined on the neighbor- 
hood graph we obtain the neighborhood graph itself and do not achieve our goal of representing 
the input data by a simple graph. It is then appealing to build a two dimensional complex having 
the neighborhood graph as 1-dimensional skeleton and use the algorithm of [27, 31] to compute 
the Reeb graph of the distance to the root point. Unfortunately adding triangles to the neighbor- 
hood graph may widely change the metric between the data points on the resulting complex and 
significantly increase the complexity of the algorithm. We overcome this issue by introducing a 
variant of the Reeb graph, the a-Reeb graph, inspired from [33] and related to the recently intro- 
duced notion of graph induced complex [14], that is easier to compute than the Reeb graph but 
also comes with approximation guarantees (see Theorem 3.11). As a consequence our algorithm 
relies on the Mapper algorithm of [33] and runs in almost linear time (see Section 4). 

Related work. Approximation of data by 1-dimensional geometric structures has been considered by 
different communities. In statistics, several approaches have been proposed to address the problem of 
detection and extraction of filamentary structures in point cloud data. For example Arial-Castro et al 
[4] use multiscale anisotropic strips to detect linear structure while [24, 23] and more recently [25] base 
their approach upon density gradient descents or medial axis techniques. These methods apply to data 
corrupted by outliers embedded in Euclidean spaces and focus on the inference of individual filaments 
without focus on the global geometric structure of the filaments network. 

In computational geometry, the curve reconstruction problem from points sampled on a curve in 
an euclidean space has been extensively studied and several efficient algorithms have been proposed 
[3, 16, 17]. Unfortunately, these methods restricts to the case of simple embedded curves (without 
singularities or self-intersections) and hardly extend to the case of topological graphs. In a more intrinsic 
setting where data come as finite abstract metric spaces, [1] propose an algorithm that outputs, under 
some specific sampling conditions, a topologically correct (up to a homeomorphism) reconstruction of 
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the approximated graph. However this algorithm requires some tedious parameters tuning and relies 
on quite restrictive sampling assumptions. When these conditions are not satisfied, the algorithm may 
fail and not even outputs a graph. Compared to the algorithm of [1], our algorithm not only comes 
with metric guarantees but also whatever the input data is, it always outputs a metric graph and does not 
require the user to choose any parameters. Our approach is also related to the so-called Mapper algorithm 
[33] that provides a way to visualize data sets endowed with a real valued function as a graph. Indeed 
the implementation of our algorithm relies on the Mapper algorithm where the considered function is the 
distance to the chosen root point. However, unlike the general mapper algorithm, our methods provides 
an upper bound on the Gromov-Hausdorff distance between the reconstructed graph and the underlying 
space from which the data points have been sampled. 

In theoretical computer science, there is much of work on approximating metric spaces using trees 
[7, 2, 12] or distribution of trees [18, 22] where the trees are often constructed as spanning trees possibly 
with Steiner points. Our approach is different as our reconstructed graph or tree is a quotient space of 
the original metric space where the metric only gets contracted (see Lemma 3.6). Finally we remark 
that the recovery of filament structure is also studied in various applied settings, including road networks 
[11, 35], galaxies distributions [13]. 

The paper is organized as follows. The basic notions and definitions used throughout the paper are 
recalled in Section 2. The Reeb and a-Reeb graphs endowed with a natural metric are introduced in 
Section 3.1 and the approximation results are stated and proven in Sections 3.2 and 3.3. Our algorithm 
is described in Section 4 and experimental results are presented and discussed in Section 5. 

2 Preliminaries: metric spaces, metric graphs and Gromov-Hausdorff 
distance 

Recall that a metric space is a pair (X, dx) where X is a set and dx ' X x X — » R is a non negative 
map such that for any x,y,z E X, dx(x,y) = if and only if x = y, dx(x,y) = dx(y,x) and 
dx(x, z) < dx(x, y) + dx(y, z). Two compact spaces (X, dx) and (Y, dy) are isometric if there exits 
a bijection ip : X —> Y that preserves the distances, namely: for any x,x' E X, dy(cp(x), (p(x')) = 
dx(x, x r ). The set of isometry classes of compact metric spaces can be endowed with a metric, the so- 
called Gromov-Hausdorff distance that can be defined using the following notion of correspondence ([6] 
Def. 7.3.17). 

Definition 2.1. Let (X, dx) and (Y, dy) be two compact metric spaces. Given e > 0, an e-correspondence 
between (X, dx) and (Y, dy) is a subset C C X x Y such that: 

i) for any x E X there exists y E Y such that (x,y) E C; 

ii) for any y E Y there exists x E X such that (x,y) E C; 

iii) for any (x,y), {x' ,y') E C, \d x {x,x') - dy(y,y')\ < s. 

Definition 2.2. The Gromov-Hausdorff distance between two compact metric spaces (X, dx) and ( Y, dy) 
is defined by 

dcH^X, Y) = ^ inf {e > : there exists an ^-correspondence between X and Y} 

A metric space (X, dx) is & path metric space if the distance between any pair of points is equal 
to the infimum of the lengths of the continuous curves joining them l . Equivalently (X,dx) is a 
path metric space if and only if for any x, y E X and any e > there exists z E X such that 
max((ix(^ 5 2), dx(y, z)) < \dx(x, y) + e [26]. In the sequel of the paper we consider compact path 
metric spaces. It follows from the Hopf-Rinow theorem (see [26] p. 9) that such spaces are geodesic, i.e. 

1 see [26] Chap. 1 for the definition of the length of a continuous curve in a general metric space 
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for any pair of point x, x' E X there exists a minimizing geodesic joining them. 2 A continuous path 
5 : / —> X where / is a real interval or the unit circle is said to be simple if it is not self intersecting, i.e. 
if S is an injective map. 

Recall that a (finite) topological graph G = (V, E) is the geometric realisation of a (finite) 1- 
dimensional simplicial complex with vertex set V and edge set E. If moreover each 1-simplex e E E 
is a metric edge, i.e. e = [a, 6] C R, then the graph G inherits from a metric do which is the unique 
one whose restriction to any e = [a, 6] E E coincides with the standard metric on the real segment [a, b]. 
Then (G, g?g<) is a metric graph (see [6], Section 3.2.2 for a more formal definition). Intuitively, a metric 
graph can be seen as a topological graph with a length assigned to each of its edges. 

The first Betti number /3i(G) of a finite topological graph G is the rank of the first homology group 
of G (with coefficient in a field, e.g. Z/2), or equivalently, the number of edges to remove from G to get 
a spanning tree. 

3 Approximation of path metric spaces with Reeb-like graphs 

Let (X, dx) be a compact geodesic space and let r E X be a fixed base point. Let d : X R be the 
distance function to r, i.e., d(x) = dx(r, x). 

3.1 The Reeb and a-Reeb graphs of d 

The Reeb graph. The relation x ~ y if and only if = and x, y are in the same path connected 
component of d~ 1 (d(x)) is an equivalence relation. The quotient space G = Xj ~ is called the Reeb 
graph of d and we denote by tt : X — >► G the quotient map. Notice that 7r is continuous and as X is 
path connected, G is path connected. The function d induces a function d* : G — > that satisfies 
d = o 7T. The relation defined by: for any g, g' E G, g <g p' if and only if d*(g) < and there 

exist a continuous path 7 in G connecting g to such that d o 7 is non decreasing, makes G a partially 
ordered set. 

The a-Reeb graphs. Computing or approximating the Reeb graph of (X, d) from a finite set of point 
sampled on X is usually a difficult task. To overcome this issue we also consider a variant of the Reeb 
graph that shares very similar properties than the Reeb graph. Let a > and let 1 = {Ii}i E / be a 
covering of the range of d by open intervals of length at most a. The transitive closure of the relation 
x r^ a y if and only if d(x) = d(y) and x,y are in the same path connected component of 
for some interval U E X is an equivalence relation that is also denoted by The quotient space 
G a = Xj r^ a is called the a-Reeb graph 3 of d and we denote by tt : X —> G a the quotient map. Notice 
that 7r is continuous and as X is path connected, G a is path connected. The function d induces a function 
d* : G a —> R + that satisfies d = d* tt. The relation defined by: for any g, g f E G a , p (/ if and 
only if (g) < and there exist a continuous path 7 in G a connecting g to (/ such that d o 7 is non 

decreasing, makes G a a partially ordered set. 

The a-Reeb graph is closely related to the graph constructed by the Mapper algorithm introduced in 
[33] making its computation much more easier than the Reeb graph (see Section 4). 

Notice that without making assumptions on X and d, in general G and G a are not finite graphs. However 
when the number of path connected components of the level sets of d is finite and changes only a finite 
number of times then the Reeb graph turns out to be a finite directed acyclic graph. Similarly, when the 
covering of X by the connected components of d _1 (/^), i E 1 is finite, the a-Reeb graph also turns out 
to be a finite directed acyclic graph. This happens in most applications and for example when (X, dx) is 

2 recall that a minimizing geodesic in X is any curve 7 : I — > X, where I is a real interval, such that dx{^f{t)^{t')) = 
\t - t'\ for any t,t x G /. 

3 strictly speaking we should call it the a-Reeb graph associated to the covering X but we assume in the sequel that some 
covering X has been chosen and we omit it in notations 



4 



a finite simplicial complex or a compact semialgebraic (or more generally a compact subanalytic space) 
with d being semi-algebraic (or subanalytic). 

All the results and proofs presented in Section 3 are exactly the same for the Reeb and the a-Reeb graphs. 
In the following paragraph and in Section 3.2, G denotes indifferently the Reeb graph or an a-Reeb graph 
for some a > 0. We also always assume that X and d (and a and T) are such that G is a finite graph. 

A metric on Reeb and a-Reeb graphs. Let define the set of vertices V of G as the union of the set of 
points of degree not equal to 2 with the set of local maxima of d* over G, and the base point 7r(r). The 
set of edges E of G is then the set of the connected components of the complement of V. Notice that 
7r(r) is the only local (and global) minimum of d*: since X is path connected, for any x E X there exists 
a geodesic 7 joining r to x along which d is increasing; d* is thus also increasing along the continuous 
curve 71(7), so tt(x) cannot be a local minimum of d*. As a consequence d* is monotonous along the 
edges of G. We can thus assign an orientation to each edge: if e = [p, q] E G is such that d*(p) < d*(q) 
then the positive orientation of e is the one pointing from p to g. Finally, we assign a metric to G. Each 
edge e E £"is homeomorphic to an interval to which we assign a length equal to the absolute difference of 
the function d* at two endpoints. The distance between two points p' of e is then |d*(p) — d*(p f )\. This 
makes G a metric graph (G, d^) isometric to the quotient space of the union of the intervals isometric 
to the edges by identifying the endpoints if they correspond to the same vertex in G. See Figure 1 for an 
example. Note that d* is continuous in (G, do) and for any p E G, d*(p) = dc(7r(r),p). Indeed this is 
a consequence of the following lemma. 

Lemma 3.1. If S is a path joining two points p,p r E G such that d* o 5 is strictly increasing then 5 is a 
shortest path between p and p' and do (p, p f ) — d* (p f ) — d* (p). 

Proo/ As d* o 5 is strictly increasing, when 5 enters an edge e by one of its end points, either it exits 
at the other end point or it stops at p' if p' E e. Moreover 5 cannot go through a given edge more than 
one time. As a consequence 5 can be decomposed in a finite sequence of pieces eo = [p,Pi],ei = 
[pi?P2], * * * , = [pn-ijPn]? e n = \p n ,p'\ where e and e n are the segments joining p and p' to 
one of the endpoint of the edges that contain them and ei, • • ■ , e n _i are edges. So, the length of S 
is equal to (d*(pi) - d*(p)) + (d*(p 2 ) - d*(pi)) + • • • + (d*(p') - d*(Pn)) = c?*(p 7 ) ~ c?*(p) and 
dc{p,P f ) < d*(pO - d*(p). 

Similarly any simple path joining p to can be decomposed in a finite sequence of pieces e f = 

[P,K] 3 e i = [Pi^] 3 ■ • • 3 4-1 = bfc-i^^4 = [PfcjP 7 ] where e o and e fc are the segments joining p 
and p' to one of the endpoint of the edges that contain them, and e[, • • ■ , e / fc _ 1 are edges. Now, as we do 
not know that d* is increasing along this path, its length is thus equal to \d*(p[) — d*{p) \ + \d*(p f 2 ) — 
d*(p[)\ + • • • + \d*(p f ) ~ rf*(Pn)| > d,(p f ) - dk(p). So, d G (p,p f ) > d*(p f ) - d*{p). ~ □ 



e(v 8 v 9 ) 

l isometric to 
\d*{v9) - d*(v$)\ 



Figure 1: Construction of a Reeb graph. 
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3.2 Bounding the Gromov-Hausdorff distance between X and G 

The goal of this section is to provide an upper bound of the Gromov-Hausdorff distance between X and 
G that only depends on the first Betti number /3i(G) of G and the maximal diameter M of the level sets 
of 7T. An upper bound of M is given in the next section. 

Theorem 3.2. <1gh(X,G) < (Pi(G) + 1)M where (Igh(X,G) is the Gromov-Hausdorff distance 
between X and G, fi\(G) is the first Betti number of G and M — sup peG {diameter(7T~ 1 (p))} is the 
supremum of the diameters of the level sets ofir. 

The proof of Theorem 3.2 will be easily deduced from a sequence of technical lemmas that al- 
low to compare the distance between pair of points x, y E X and the distance between their images 

n(x),7r(y) E G 

A vertex v E V is called a merging vertex if it is the end point of at least two edges e\ and e2 that are 
pointing to it according to the orientation defined in Section 3.1. Geometrically this means that there are 
at least two distinct connected components of 7r _1 (d" 1 (d*(v) — e)) that accumulate to 7r -1 (?;) as e > 
goes to 0. The set of merging vertices is denoted by V m . 

The following lemma provides an upper bound on the number of vertices in V m . 

Lemma 3.3. The number of elements in V m is equal to Pi(G) where 0i(G) is the first homology group 
ofG. 

Proof The result follows from classical homology persistence theory [20]. First remark that, as 7r(r) is 
the only local minimum of d*, the sublevel sets of the function d* : G — >> R + are all path connected. 
Indeed if 7r(x),7r(y) E G are in the same sublevel set d~ 1 ([0,a]), a > 0, then the images by tt of 
the shortest paths in X connecting x to r and y to r are contained in d^ 1 ([0, a]) and their union is a 
continuous path joining tt(x) to ir(y). As a consequence, the 0-dimensional persistence of d* is trivial. 
So, for the persistence algorithm applied to d*, the vertices of V m are the positive simplices (see [21] for 
the notion of positive and negative simplices for persistence). It follows that each vertex of V m creates 
a cycle that never dies as G is one dimensional and does not contain any 2-dimensional simplex. Thus 
\V m \=p 1 (G). □ 

Lemma 3.4. Letp^p' E G and let 8 : [d*(p), d*(p')] — )► G be a strictly increasing path going from p to p f 
that does not contain any point of V m in its interior. Then for any x r E 7r -1 (y)nc/(7r -1 (5(d*(p), d*(p'))) 
where cl{.) denotes the closure, there exists a shortest path 7 connecting a point x of7r~ 1 (p) to x' such 
thatn^) = 5 and dx(x,x f ) = d(V) — d(x) = d*(p f ) — d*(p) = dG(p,p')- 

Notice that from Lemma 3. 1, 6 is a shortest path and the parametrization by the interval [d* (p) , d* (p 7 )] 
can be chosen to be an isometric embedding. 

Proof First assume that p' is not a merging point. Let 70 : [0, d{x')] — » X be any shortest path between 
r and a/ and let 7 be the restriction of 70 to [d*(p), d(x 7 )] = [d*(p), d*^')]. If the infimum to of the set 
I = {£ E [d*(p), d*(p 7 )] : 7r(7(t / )) E 5, Vt 7 > t} is larger than d*(p), then 7r(7(to)) then there exists an 
increasing sequence (t n ) that converges to to such that j(t n ) S. As a consequence 5 (to) is a merging 
point; a contradiction. So to = d*(p) and 7(d*(p)) intersects 7r _1 (p) at a point x. 

Now if y is a merging point, as x' is chosen in the closure of 7r _1 (5(d* (p) , d* (p 7 ) ) , for any sufficiently 
large n E N one can consider a sequence of points E 7r _1 (5(d >t; (p / ) — 1/n)) that converges to x r and 
apply the first case to get a sequence of shortest path 7^ from a point x n E 7r -1 (p) and x^. Then 
applying Arzela-Ascoli's theorem (see [19] 7.5) we can extract from j n a sequence of points converging 
to a shortest path 7 between a point x E 7r _1 (p) and x 7 . 

To conclude the proof, notice that from Lemma 3.1 we have dcf(p,p / ) = d^p 7 ) — d*(p) = d(x 7 ) — 
d(x). Since 7 is the restriction of a shortest path from r to x we also have dx(x, x') = d{x') — d(x). □ 

Lemma 3.5 and Lemma 3.6 allow to compare the metrics dx and d^. 
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Lemma 3.5. For any x,y E X we have 

d x (x,y) ^ d G (ir(x),ir(y)) + 2(^(G) + 1)M 

where 0i(G) is the first Betti number ofG and M — sup peG {diameter(7v~ 1 (p))}. 

Proof. Let S be a shortest path between tt(x) and 7r(y). Remark that except at the points tt(x) and 7r(y) 
the local maxima of the restriction of d* to S are in V m . Indeed as S is a shortest path it has to be simple, 
so if p E S is a local maximum then p has to be a vertex and S has to pass through two edges having p as 
end point and pointing to p according to the orientation defined in Section 3.1. So p is a merging point. 
Since S is simple and V m is finite, S can be decomposed in at most | V m | + 1 connected paths along the 
interior of which the restriction of d* does not have any local maxima. So along each of these connected 
paths the restriction of d* can have at most one local minimum. As a consequence, S can be decomposed 
in a finite number of continuous paths 5i, 62, • • • ,5k with k ^ 2(\V m \ + 1), such that the restriction of 
d* to each of these path is strictly monotonous. For any i E {1, • • • , k} let pi and pi+\ the end points 
of Si with pi = 7r(x) and Pk+i = ?r(y). We can apply Lemma 3.4 to each Si to get a shortest path 
7^ in X between a point E tt _1 (^) and a point in y^ + i E 7r _1 (^ + i) such that 7r(7^) = Si and 
dx{xi, j/i+i) = dc(pi,Pi+i)> The sum of the lengths of the paths 7^ is equal to the sum of the lengths 
of the path Si which is itself equal to d G {ii{x) 1 Ti{y)). Now for any i E {1, • • • , k}, since 7r(xi) = ir(yi) 
we have dx(%i, y%) ^ M and Xi and y^ can be connected by a path of length at most M (x\ is connected 
to x and yk+i is connected to y. Gluing these paths to the paths 7^ gives a continuous path from x to y 
whose length is at most d<3(7r(a;), 7r(y)) + fcM < d G (7r(x), n(y)) + 2(|V^| + 1)M. Since from Lemma 
3.3, | V m | < we finally get that d x (x, y) < d G (^(x), ^(y)) + 2(/3(G) + 1)M. □ 

Lemma 3.6. 77ze map ir : X G is 1-Lipschitz: for any x,y £ X we have 

d G (7r(x),7r(y)) < d x {x,y). 

Proof Let x, y E X and let 7 : / — >► X be a shortest path from x to y in X where / C R is a closed 
interval. The path ^(7) connects tt(x) and 7r(y) in G. 

We first claim that there exists a continuous path T contained in ^(7) connecting tt(x) and n(y) that 
intersects each vertex of G at most one time. The path T can be defined by iteration in the following way. 
Let ui, • • • v n E V be the vertices of G that are contained in ^(7) \ {tt(x), 7r(y)} and let Tq = ^(7) : 
Jo = J — > G. For i = 1, • • • n let = inf{£ : IVi(t) = Vi} and = sup{£ : = ^} and 

define T^ as the restriction of IV 1 to = J^_i \ The path T^ is a connected continuous path 

(although is a disjoint union of intervals) that intersects the vertices vi,V2, — m > ^ at most one time. 
We then define T = T n : J = J n —> G where J C I is a finite union of closed intervals. Notice that V 
is the image by tt of the restriction of 7 to J and that T(t) E • • • v n } only if t is one of the endpoints 
of the closed intervals defining J. 

Now, for each connected component [£, t'} of J, 7((t, t 7 )) is contained in 7r _1 (e) where e is the edge 
of G containing r([t, t 7 ]). As a consequence, d<3(7r(7)(£), 7r(7)(£ 7 )) = |d*(7r(7)(£) — d*(7r(7)(£ 7 ))| = 
\d(j(t)) - d(7(t 7 ))|. Recalling that d(j(t)) = d x (r,j(t)) and d(7(t 7 )) = d x (r,7(t 7 )) and using the 
triangle inequality we get that \d(^(t)) — d(^(t f ))\ < dx{^{t)^{t')). To conclude the proof, since 7 is 
a geodesic path we just need to sum up the previous inequality over all connected components of J: 

d x (x,y) > ^( 7 (t), 7 (t 7 )) > d G (7r( 7 )(t),7r( 7 )(t f )) > d G (<7r(x), 7r(y)) 

[*,*'] Gcc(J) £cc( J) 

where cc(J) is the set of connected components of J. □ 

The proof of Theorem 3.2 now easily follows from Lemmas 3.5 and 3.6. 

Proof (of Theorem 3.2) Consider the set C {(x,7r(x)) : x E X} C X x G. As tt is surjec- 
tive this is a correspondence between X and G. It follows from Lemmas 3.5 and 3.6 that for any 

(x,7r(x)),(y,7r(y)) E C, 

\d x (x, y) - d G (7r(*),7r(y))| < 2(/3i (G) + 1)M 
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where f3 1 (G) is the first Betti number of G and M = sup pGG {diameter(7r _1 (p))}. So C is a (2(/3i(G) + 
l)M-correspondence and d GH (X, G) < + 1)M. □ 



3.3 Bounding M 

To upperbound the diameter of the level sets of tt we first prove the two following general lemmas. 

Lemma 3.7. Let (G, do) be a connected finite metric graph and let r E G We denote by d r — dc(r, .) : 
G — > [0, +oo) the distance to r. For any edge E C G, the restriction of d r to e is either strictly 
monotonous or has only one local maximum. Moreover the length I — 1(E) of E is upper bounded by 
two times the difference between the maximum and the minimum of d r restricted to E. 

Proof. Let / be the length of E and let t \-> e(t), t E [0, /], be an arc length parametrization of E. Since 
E is an edge of G, for t E [0, /] any shortest geodesic jt joining r to e(t) must contain either x\ = e(0) 
or X2 = e(l). If it contains x\ then for any t f < t the restriction of j t between r and e{t') is a shortest 
geodesic containing x\ and if it contains X2 then for any t! > t the restriction of j t between r and e(t') is 
a shortest geodesic containing X2. Moreover in both cases, the function d r is strictly monotonous along 
7. As a consequence, the set I\ = {t E [0, 1] : a shortest geodesic joining r to e(t) contains x\] is a 
closed interval containing 0. Similarly the set I2 = {t E [0, /] : a shortest geodesic joining r to e(t) 
contains X2} is a closed interval containing / and [0, 1] = I\ U I2. Moreover d r is strictly monotonous 
on e(ii) and on e(/2)- As a consequence /1 D ^2 is reduced to a single point to that has to be the unique 
local maximum of d r restricted to E. 

The second part of the lemma follows easily from the previous proof: the minimum of d r restricted 
to E is attained either at x\ or X2 and d r (e(to)) = d r {x\) + to = d r (x2) + I — to is the maximum of d r 
restricted to E. We thus obtain that 2to = I + (d r (x2) — d r (x\)). As a consequence if d r (x\) < d r (x2) 
then Z/2 ^ to = d r (e(to)) — similarly if d r (xi) ^ d r (x2) then Z/2 ^ Z — to = d r (e(to)) — 

d r {x2). n 

Lemma 3.8. Let (G, do) be a connected finite metric graph and let r E G. For a > we denote by 
Ne(ol) the number of edges ofG of length at most a. For any d > and any connected component B of 
the set Bd,a — { x £ G : d — a < dciy, x) ^ d + a} we have 

diam(B) < 4(2 + N E (4a))a 

The example of figure 2 shows that the bound of Lemma 3.8 is tight. 




Figure 2: Tightness of the bound in Lemma 3.8: there are 3 edges of length at most 4a: and the diameter 
of B is equal to 20a. 

Proof. Let x, y E B and let £ \-> j(t) E 5 be a continuous path joining x to y in Let be an edge of 
G that does not contain x or y and with end points xi, X2 such that 7 intersects the interior of E. Then 
7 -1 (i?) is a disjoint union of closed intervals of the form / = [£, t'] where j(t) and ^(t') belong to the 
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set {xi, X2}. If j(t) = 7(t 7 ) we can remove the part of 7 between t and t' and still get a continuous 
path between x and y. So without loss of generality we can assume that if 7 intersects the interior of 
E, then E is contained in 7. Using the same argument as previously we can also assume that if 7 goes 
across E, it only does it one time, i.e. 7 _1 (£ 1 ) is reduced to only one interval. As a consequence, 7 can 
be decomposed in a sequence [x, vq], E\, E<i, E^ [v^, y] where [x, vo] and [i^, y] are pieces of edges 
containing x and y respectively and E\ = [vq, vi], E2 = i^]-, = [^jfe-i ? ^fc] are pairwise distinct 
edges of G contained in B. It follows from Lemma 3.7 that the lengths of the edges E\,- • • E\~ and 
of [x, uq] and y] are upper bounded by 4<x As a consequence the length of 7 is upper bounded by 
4(fc + 2)a which is itself upper bounded by 4(A^(4a:) + 2) a since the edges E\, - • - are pairwise 
distinct. It follows that d G {x, y) < 4(N E (4a) + 2)a. □ 

Theorem 3.9. Let (G, d^*) ^ connected finite metric graph and let (X, dx) ^ compact geodesic 
metric space such that dcii(X, G) < e for some e > 0. Let xo E X be a fixed point and let d XQ — 
dx(xo, .) : X — >► [0, +00) Z?e //ze distance function to xo- Then for d ^ a ^ the diameter of any 
connected component L ofd~^([d — a, d + a]) satisfies 

diam(L) < 4(2 + N E (4(a + 2e)))(a + 2e) + e 

where A^^(4(a + 2s)) w //ze number of edges ofG of length at most 4(a + 2s). 7n particular if a — 
and 8£ w smaller that the length of the shortest edge of G then the diameter of L is upper bounded by 
lie. 

Proof Let e' > be such that d<3# (X, G) < e' < e. Let CcIxGbean ^-correspondence between 
X and G and (xo, r) E G. we denote by d r = dc(r, •) : G —> [0, +00) the distance function to r in G. 
Let x a , X5 E L and let (x a , y a ), (x^, y&) E G. There exists a continuous path 7 C L joining x a to X5. 
Since G is an ^-correspondence for any x E 7 there exists a point (x, y) E C such that d — a — £ 7 < 
d r (y) < d+a+£ 7 . The set of points y obtained in this way is not necessarily a continuous path from y a to 
y&. However one can consider a finite sequence x\ = x a , X2, • • • , x n = X5 of points in 7 such that for any 
i = 1, • • • n—1 we have dx(%i, 2^+1) < £— £ 7 . If j/i) E C then we have dc(yi, y^+i) < £— £ 7 +£ 7 = £. 
As a consequence, since d — a — e < d — a — e' < d r {yi) < d + a + s f < d + a + £ the shortest 
geodesic connecting yi to yi+i in G remains in the set d" 1 ( [d — a — 2e, d + a + 2e] ) and connecting these 
geodesies for alii = 1, • • • , n — 1 we get a continuous path from y a to y^ in d~ 1 ([d — a — 2e, d+a J r 2e\). 
It then follows from Proposition 3.8 that dc(y a , y b ) 4(2 + A^(4(a + 2s))) (a + 2e) and since C is 
an ^-correspondence (and so an ^-correspondence), dx{x ai X5) < 4(2 + A^(4(a + 2e)))(a + 2s) + e. 

□ 

As a corollary of the Theorem 3.9 and Theorem 3.2 we immediately obtain the following results for 
the Reeb graph and the a-Reeb graphs. 

Theorem 3.10. Let (X, dx) be a compact connected path metric space, let r E X be a fixed base point 
such that the metric Reeb graph (G, do) of the function d — dx(r, .) : X — >► R w a finite graph. If for a 
given e > //zere gjew^ a finite metric graph (G 7 , d^/) ^wc/z //za/ dcH(X, G f ) < e then we have 

d GH (X,G) < (/3i(G) + l)(17 + 8W EiG ,(8e))e 

where Ne,g' (8^) ^ number of edges ofG 1 of length at most 8s. In particular if X is at distance less 
than efrom a metric graph with shortest edge length larger than 8e then dcH (X, G) < 17(/?i (G) + 1)£. 

Theorem 3.11. Let (X, dx) be a compact connected path metric space. Let r E X, a > and I be a 
finite covering of the segment [0,Diam(X)] by open intervals of length at most a such that the a-Reeb 
graph G a associated to I and the function d — dx{r, .) :l4Rw a finite graph. If for a given e > 
there exists a finite metric graph (G 7 , dc>) such that dcH(X, G 7 ) < e then we have 

d GH (X, Go) < (Pi(G a ) + 1)(4(2 + N E , G ,(4(a + 2e)))(a + 2e) + e) 

where NE,G'(^( a + 2s)) w ^ number of edges ofG' of length at most 4(a + 2s). particular if 
X is at distance less than e from a metric graph with shortest edge length larger than 4 (a + 2e) then 
(Igh(X, G a ) < (Pi(G a ) + l)(8a + 17s). 
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4 Algorithm 



In this section, we describe an algorithm for computing a-Reeb graph for some a > 0. We assume the 
input of the algorithm is a neighboring graph H = (V, E), a function I : E —> R + specifying the edge 
length and a parameter a. In the applications where the input is given as a set of points together with 
pairwise distances, i.e., a finite metric space, one can generate the neighboring graph H as a Rips graph 
of the input points with the parameter chosen as a fraction of a. 

Our algorithm can be described as follows. We assume H is connected as one can apply the algorithm 
to each connected component if H is not . Figure 3 illustrates the different steps of the algorithm. In 
the first step, we fix a node of H as the root r and then obtain the distance function d : V — » R + by 
computing d(y) as the graph distance from the node v to r. In the second step, we apply the Mapper 
algorithm [33] to the nodes V with filter d to construct a graph G. Specifically, let X = {(ia, (i + 
l)a), ((i + 0.5)a, (i + 1.5) a) |0 < i < m} so that Uj ke xlk covers the range of the function d. We say an 
interval J/ Cl £ X is lower than another interval £ X if the midpoint of 7/ Cl is smaller than that of . 
Now let Hj, be the subgraph of H restricted to = Namely two nodes in are connected 

with an edge if they are in H. Notice that each subgraph may have several connected components, 
which can be listed in an arbitrary order. Denote H l k the l-th connected component of and V£ its set 
of nodes. Think of {V^}^ as a cover of V. Then the graph G constructed by the Mapper algorithm is 
the 1-skeleton of the nerve of that cover. Namely, each node in G represents an element in {V^}/^, i.e., 
a subset of nodes in V. Two nodes V^ 1 and Vj^ are connected with an edge if V^ 1 fl ^ 0. 

In the final step, we represent each node V£ in G using a copy of the interval As mentioned 
in the Section 3.1, a-Reeb graph is a quotient space of the disjoint union of those copies of intervals. 
Specifically, for an edge in G, let Vj} and Vj£ be its endpoints. Then 7/ Cl and must be partially 
overlapped. We identify the overlap part of these two intervals. After identifying the overlapped intervals 
for all edges in G, the resulting quotient space is the a-Reeb graph. Algorithmically, the identification is 
performed as follows. We split each copy of internal into two by adding a point in the middle. Now 
think of it as a graph with two edges and label one of them upper and the other lower. Notice that two 
overlapped intervals 7/ Cl and can not be exactly the same. One must be lower than the other. To 
identify their overlapped part, we identify the upper edge of the lower interval with the lower edge of the 
upper interval. 

The time complexity of the above algorithm is dominated by the computation of the distance function 
in the first step, which is 0(\E\ + \V\ log \ V\). The computation of the connected components in the 
second step is 0(\V\ log \ V\) based on union-find data structure. In the final step, there are at most 
0(|y|) number of the copies of the intervals. Based on union-find data structure, the identification can 
also be performed in 0(\V\ log \ V\) time. 

5 Experiments 

In this section, we apply our algorithm to a few data sets. The first data set is that of earthquake loca- 
tions through which we wish to learn the geometric information about earthquake faults. The raw data 
was obtained from USGS Earthquake Search [32] and consists of earthquakes between 01/01/1970 and 
01/01/2010, of magnitude greater than 5.0, and of location in the rectangular area between latitudes -75 
degrees and 75 degrees and longitude between -170 degrees and 10 degrees. The raw earthquake data 
set contains the coordinates of the epicenters of 12790 earthquakes. We follow the procedure described 
in [1] to remove outliers and randomly sampled 1600 landmarks. Finally, we computed a neighboring 
graph from these landmarks with parameter 4. The length of an edge in this graph is the Euclidean dis- 
tance between its endpoints. For each connected component, we fix a root point and compute the graph 
distance function d to the root point as shown in Figure 4(a). Set a also equals 4 and apply our algorithm 
to the above data to obtain the a-Reeb graph. In general a-Reeb graph is an abstract metric graph. In this 
example, for the purpose of visualization, we use the coordinates of the landmarks to embed the graph 
into the plane as follows. Recall that for a copy of interval representing the node V k in G, we split 
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q disjoint union of c^-Reeb graph 

copies of intervals 



Figure 3: Illustration of the different steps of the algorithm for computing a-Reeb graph. In the disjoint 
union of copies of intervals, the subintervals marked with same labels are identified in the a-Reeb graph. 



it into two by adding a point in the middle. We embed the endpoints of the interval to the landmarks of 
the minimum and the maximum of the funciton d in V^, and the point in the middle to the landmark of 
the median of the function d in V^. Figure 4(b) shows the embedding of the a-Reeb graph. Note this 
embedding may introduce metric distortion, i.e., the Euclidean length of the edge may not reflect the 
length of the corresponding edge in the a-Reeb graph. 

The second data set is that of 500 GPS traces tagged "Moscow" from OpenStreetMap [30]. Since cars 
move on roads, we expect the locations of cars to provide information about the metric graph structure 
of the Moscow road network. We first selected a metric e-net on the raw GPS locations with e = 
0.0001 using furthest point sampling. Then, we computed a neighboring graph from the samples with 
parameter 0.0004. Again for each connected component, we fix a root point and compute the graph 
distance function d to the root point as shown in Figure 5(a). Set a also equals 0.0004 and compute the 
a-Reeb graph. Again, we use the same method as above to embed the a-Reeb graph into the plane, as 
shown in Figure 5(b). 

To evaluate the quality of our a-Reeb graph for each data set, we computed both original pairwise 
distances, and pairwise distances approximated from the constructed a-Reeb graph. For GPS traces, we 
randomly select 100 points as the data set is too big to compute all pairwise distances. We also evaluated 
the use of a-Reeb graph to speed up distance computations by showing reductions in computation time. 
Only pairs of points in the same connected component are included because we obtain zero error for 
the pairs of vertices that are not. Statistics for the size of the reconstructed graph, error of approximate 
distances, and reduction in computation time are given in Table 1. 

Road network is directional. There are one-way streets. In fact, roads are often split so that cars in 
different directions run in different lanes. In particular, this is the true for highways. In addition, when 
two roads cross in GPS coordinates, they may bypass through a tunnel or an evaluated bridge and thus 
the road network itself may not cross. Since in most circumstances, drivers follow the road network and 
do not drive against the traffic, such directional information is contained in the GPS traces. Here we 
encode the directional information by stacking several consecutive GPS coordinates to form a point in 
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a higher dimensional space. In this way, we obtain a new set of points in this higher dimension space. 
Then we build a neighboring graph for this new set of points based on L2 norm and apply our algorithm 
to recover the road network. In particular, although the paths intersect at the cross in GPS cooridnates, 
the roadnet work does not. 

We first test the above strategy on a synthetic dataset, where we simulate a car driving on a highway 
crossing according to the right-driving rule as shown in Figure 6(a). 300 hundred traces are obtained, each 
of which contains 500 points. Stack 10 consecutive points along a trace to form a point in 20-dimensional 
space. In this way, obtain a point set in 20-dimensional space. Build a neighboring graph over this set of 
points where the length of an edge is the Euclidean distance of its endpoints, and compute the distance 
function as the graph distance to an arbitrarily chosen point as shown in Figure 6(b) where the points are 
projected to three space using the 1st, 2nd and 17th coordinates. To visualize the reconstructed a-Reeb 
graph, we choose the landmarks in the same way as the previous examples and then project them using 
the above three coordinates. Figure 6(c) and (d) show the reconstructed graph in two viewpoints. As we 
can see, we recover the road network. 

Next we extract those GPS traces from the above "Moscow" dataset which pass through a highway 
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GPS traces 


Earthquake 


#Original points 


82541 


1600 


#Original edges 


313415 


26996 


#Nodes in a-Reeb graph 


21644 


147 


#Edges in a-Reeb graph 


21554 


137 


Graph reconstruction time 


46.8 


0.32 


Original Dist Comp Time 


15.27 


1.12 


Approx Dist Comp Time 


0.96 


0.01 


Mean distortion 


6.5% 


14.1% 


Median distortion 


5.3% 


12.5% 



Table 1: The graph reconstruction time is the total time of computing distance function and reconstruct- 
ing the graph. The original distance computation time shows the total time of computing these distances 
using the original graph. The approximate distance computation time is the total time to compute ap- 
proximate distances based on the reconstructed a-Reeb graph. All times are in seconds. 

crossing as shown in Figure 7(a). Since GPS records the position based on time, we resample the traces 
so that the distances between any two consecutive samples is the same among all traces. Then we apply 
the above algorithm to the resampled traces. Figure 7(c) and (d) show the reconstructed graph which 
recovers the road network of this highway crossing. 

6 Discussion 

We have proposed a method to approximate path metric spaces using metric graphs with bounded 
Gromov-Hausdorff distortion, and illustrated the performances of our method on a few synthetic and 
real data sets. Here we point out a few possible directions for future work. First, notice that the a-Reeb 
graph is a quotient space where the quotient map is 1-Lipschitz and thus the metric only gets contracted. 
In addition, the distance from a point to the chosen root is exactly preserved. Therefore, one always 
reduces the metric distortion by taking the maximum of the graph metrics of different root points. It is 
interesting to study the strategy of sampling root points to obtain the smallest metric distortion with the 
fixed number of root points. Second, our method in the current form does not recover the topology of the 
underlying metric space. The tools recently developed in persistence homology seem useful for recov- 
ering topology: we provide a preliminary persistence-based result in the Appendix showing that the first 
Betti number of the underlying metric graph can be inferred from the data. On another hand, Reeb graphs 
have recently been used for topological inference purpose in [15]. It would be interesting to combine our 
method to these approaches to also obtained topologically correct reconstruction algorithms. Finally, our 
method is sensitive to the noise. One can of course preprocess the data and remove the noise and then 
apply our algorithm. Nevertheless, it is interesting to see if the algorithm can be improved to handle 
noise. 
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(c) (d) 

Figure 6: (a) A highway crossing and the synthetic traces, (b) The distance function, (c) and (d) The 
reconstructed a-Reeb graph viewed from two perspectives. 
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Appendix 

Getting the first Betti number of a graph from an approximation 

Although our metric graph reconstruction algorithm does not provide topological guarantees, we show 
below that, using persistent topology arguments, that the first Betti number of a graph can be inferred 
from an approximation. 

Recall that given a compact metric space (X, dx) and a real parameter a > 0, the Vietoris-Rips 
complex Rips(X, a) is the simplicial complex with vertex set X and whose simplices are the finite 
subsets of X with diameter at most a: 

a = [xq, xi, • • • , Xk] E Rips(X, a) 44> dx(xi, xj) ^ a forall i, j. 
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Lemma 6.1. Let G be a connected metric graph and let 1(G) be the length of the shortest loop in 
G that is not homologous to 0. For any metric space D such that dcH(G 1 D) < jqI(G) and any 
dGH(G, D) < a < yqI(G), the first Betti number of G is given by 

61(G) = rank(i7i(Rips(L>,a)) #i(Rips(£>, 3a)) 

where the homomorphism between the homology groups is the one induced by the inclusion maps between 
the Rips complexes. 

Proof The proof follows from a result of [28] that relates the homology of the Rips complexes built 
on top of G to the homology of G and a result of [9] that allows to relate the Rips filtration built on 
top of G and D at the homology level. Since G is a geodesic path, it follows from Theorem 3.5 and 
Remark 2), p.179 in [28] that for any a < \l(G), Rips(G, a) and G are homotopy equivalent. More- 
over, from Proposition 3.3 in [28], for any a < a' < \l(G), the homomorphism i?i(Rips(G, a)) —> 
i7i(Rips(G, a')) induced by the inclusion map is an isomorphism. 

Now let C C D x G be an ^-correspondence between D and G where e < ^/(G). According to [9], 
the persistence modules (i7i(Rips(,D, aO)aeR+ an d (i?i(Rips(G, are ^-interleaved. Now let a 

be as in the statement of the lemma and let f3 > be such that /3 + e < a. The ^-interleaving induces the 
following sequence of homomorphisms 

#i(Rips(G,/3)) iTi(Rips(£>,a)) #i(Rips(G, a+e)) -> J3 r i(Rips(Z>,3a)) #i(Rips(G, 3a+e)) 

where the composition of two consecutive homomorphisms is the homomorphism induced by the inclu- 
sion map between the corresponding Rips complexes. As a consequence since 3a J r s < \l(G) the homo- 
morphisms ifi(Rips(G, (3)) -> i7i(Rips(G, a + e)) and #i(Rips(G, a + e)) -> ifi(Rips(G, 3a + e)) 
are isomorphisms of rank 61(G). It follows that the rank of Hi(Rips(D, a)) — >> Hi(Rips(D, 3a)) is 
equal to 61(G). □ 
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